146 research outputs found

    PGLCM: Efficient Parallel Mining of Closed Frequent Gradual Itemsets

    Get PDF
    International audienceNumerical data (e.g., DNA micro-array data, sensor data) pose a challenging problem to existing frequent pattern mining methods which hardly handle them. In this framework, gradual patterns have been recently proposed to extract covariations of attributes, such as: "When X increases, Y decreases". There exist some algorithms for mining frequent gradual patterns, but they cannot scale to real-world databases. We present in this paper GLCM, the first algorithm for mining closed frequent gradual patterns, which proposes strong complexity guarantees: the mining time is linear with the number of closed frequent gradual item sets. Our experimental study shows that GLCM is two orders of magnitude faster than the state of the art, with a constant low memory usage. We also present PGLCM, a parallelization of GLCM capable of exploiting multicore processors, with good scale-up properties on complex datasets. These algorithms are the first algorithms capable of mining large real world datasets to discover gradual patterns

    La fouille de données

    Get PDF
    International audienc

    LCE: An Augmented Combination of Bagging and Boosting in Python

    Full text link
    lcensemble is a high-performing, scalable and user-friendly Python package for the general tasks of classification and regression. The package implements Local Cascade Ensemble (LCE), a machine learning method that further enhances the prediction performance of the current state-of-the-art methods Random Forest and XGBoost. LCE combines their strengths and adopts a complementary diversification approach to obtain a better generalizing predictor. The package is compatible with scikit-learn, therefore it can interact with scikit-learn pipelines and model selection tools. It is distributed under the Apache 2.0 license, and its source code is available at https://github.com/LocalCascadeEnsemble/LCE

    Mining XML Documents

    Get PDF
    XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Giving the increasing size of XML collections as information sources, mining techniques that traditionally exist for text collections or databases need to be adapted and new methods to be invented to exploit the particular structure of XML documents. Basically XML documents can be seen as trees, which are well known to be complex structures. This chapter describes various ways of using and simplifying this tree structure to model documents and support efficient mining algorithms. We focus on three mining tasks: classification and clustering which are standard for text collections; discovering of frequent tree structure which is especially important for heterogeneous collection. This chapter presents some recent approaches and algorithms to support these tasks together with experimental evaluation on a variety of large XML collections

    Towards a Framework for Semantic Exploration of Frequent Patterns

    Get PDF
    http://ceur-ws.org/Vol-1075/ - ISSN: 1613-0073International audienceMining frequent patterns is an essential task in discovering hidden correlations in datasets. Although frequent patterns unveil valuable information, there are some challenges which limits their usability. First, the number of possible patterns is often very large which hinders their eff ective exploration. Second, patterns with many items are hard to read and the analyst may be unable to understand their meaning. In addition, the only available information about patterns is their support, a very coarse piece of information. In this paper, we are particularly interested in mining datasets that reflect usage patterns of users moving in space and time and for whom demographics attributes are available (age, occupation, etc). Such characteristics are typical of data collected from smart phones, whose analysis has critical business applications nowadays. We propose pattern exploration primitives, abstraction and refinement, that use hand-crafted taxonomies on time, space and user demographics. We show on two real datasets, Nokia and MovieLens, how the use of such taxonomies reduces the size of the pattern space and how demographics enable their semantic exploration. This work opens new perspectives in the semantic exploration of frequent patterns that reflect the behavior of di fferent user communities

    : QuickFill, QuickMixte: block approaches for reducing the number of programs in program synthesis

    Get PDF
    International audienceRepetitive tasks are most often tedious; in order to facilitate their execution, program synthesis approaches have been developed. They consist in automatically inferring programs that satisfy the intention of a user. The best known approach in program synthesis is FlashFill integrated into the Excel spreadsheet which allows the processing of character strings. In FlashFill user intent is represented by examples i.e, pairs (input, output). FlashFill explores a very large space of programs and therefore can require a lot of execution time and infer a lot of programs some of which work on given examples but do not capture the user's intent. In this article, we propose two approaches QuickMixte and QuickFill based on blocks which aim to guide the exploration of the program space of FlashFill by enriching the specifications provided by the user. These approaches ask the user to provide associations between subparts of the output and the input to refine the specifications. Experiments carried out on a series of 12 datasets show that QuickMixte and QuickFill make it possible to considerably reduce the program space of FlashFill. We show that with these approaches, it is often possible to give fewer examples than with the original FlashFill algorithm for a larger proportion of correct programs.Keywords : Synthesis of programs, programming by example, manipulation of character strings, repetitive tasks, block approach.Les tùches répétitives sont le plus souvent fastidieuses ; afin de faciliter leur exécution, les approches de synthÚse de programmes ont été développées. Elles consistent à inférer automatiquement des programmes qui satisfont l'intention d'un utilisateur. L'approche la plus connue en synthÚse de programmes est FlashFill intégrée au tableur Excel qui permet le traitement des chaßnes de caractÚres. Dans FlashFill l'intention de l'utilisateur est représentée par des exemples i.e, des couples (entrée, sortie). FlashFill explore un trÚs grand espace de programmes et peut donc nécessiter un temps d'exécution important et inférer beaucoup de programmes dont certains fonctionnent sur des exemples donnés mais ne capturent pas l'intention de l'utilisateur. Dans cet article, nous proposons deux approches QuickMixte et QuickFill basées sur les blocs qui visent à guider l'exploration de l'espace de programmes de FlashFill en enrichissant les spécifications fournies par l'utilisateur. Ces approches demandent à l'utilisateur de fournir des associations entre les sous-parties de la sortie et de l'entrée pour affiner les spécifications. Les expérimentations menées sur une série de 12 jeux de données montrent que QuickMixte et QuickFill permettent de réduire considérablement l'espace de programmes de FlashFill. Nous montrons qu'avec ces approches, il est souvent possible de donner moins d'exemples qu'avec l'algorithme FlashFill original pour une plus grande proportion de programmes corrects

    TAG: Learning Timed Automata from Logs

    Get PDF
    International audienceEvent logs are often one of the main sources of information to understand the behavior of a system. While numerous approaches have extracted partial information from event logs, in this work, we aim at inferring a global model of a system from its event logs. We consider real-time systems, which can be modeled with Timed Automata: our approach is thus a Timed Automata learner. There is a handful of related work, however, they might require a lot of parameters or produce Timed Automata that either are undeterministic or lack precision. In contrast, our proposed approach, called TAG, requires only one parameter and learns a deterministic Timed Automaton having a good tradeoff between accuracy and complexity of the automata. This allows getting an interpretable and accurate global model of the real-time system considered. Our experiments compare our approach to the related work and demonstrate its merits

    VCNet: A self-explaining model for realistic counterfactual generation

    Get PDF
    International audienceCounterfactual explanation is a common class of methods to make local explanations of machine learning decisions. For a given instance, these methods aim to find the smallest modification of feature values that changes the predicted decision made by a machine learning model. One of the challenges of counterfactual explanation is the efficient generation of realistic counterfactuals. To address this challenge, we propose VCNet-Variational Counter Net-a model architecture that combines a predictor and a counterfactual generator that are jointly trained, for regression or classification tasks. VCNet is able to both generate predictions, and to generate counterfactual explanations without having to solve another minimisation problem. Our contribution is the generation of counterfactuals that are close to the distribution of the predicted class. This is done by learning a variational autoencoder conditionally to the output of the predictor in a join-training fashion. We present an empirical evaluation on tabular datasets and across several interpretability metrics. The results are competitive with the state-of-the-art method

    Le web sémantique en aide à l'analyste de traces d'exécution

    Get PDF
    International audienceL'analyse de traces d' exĂ©cution est devenue l'outil priv-ilĂ©giĂ© pour dĂ©bugger et optimiser le code des applications sur les syst emes embarquĂ©s. Ces syst emes ont des architec-tures complexes basĂ©es sur des composants intĂ©grĂ©s appelĂ©s SoC (System-on-Chip). Le travail de l'analyste (souvent, un dĂ©veloppeur d'application) devient un vĂ©ritable challenge car les traces produites par ces syst emes sont de tr es grande taille et les ev enements qu'ils contiennent sont de bas niveau. Nous proposons d'aider ce travail d'analyse en utilisant des outils de gestion des connaissances pour faciliter l'explo-ration de la trace. Nous proposons une ontologie du do-maine qui dĂ©crit les principaux concepts et contraintes pour l'analyse de traces issues de SoC. Cette ontologie reprend les paradigmes d'ontologie lĂ©g ere pour supporter le passagĂš a l'ÂŽ echelle de la gestion des connaissances. Elle utilise des technologies de " triple store " RDF pour son exploitation a l'aide de requĂȘtes dĂ©claratives SPARQL. Nous illustrons notre ap-proche en offrant une analyse de meilleure qualitĂ© des traces d'un cas d'utilisation rĂ©el
    • 

    corecore